Search CORE

arXiv.org e-Print Archive

Non-parametric clustering over user features and latent behavioral functions with dual-view mixture models

Author: A Kumar
Alberto Lumbreras
Bertrand Jouve
CE Rasmussen
D Görür
DB Dahl
E Anderson
EV Bonilla
Julien Velcin
KW Cheung
M Pellegrini
M Plummer
Marie Guégan
MB Eisen
MPS Brown
P Pavlidis
RM Neal
T Kamishima
W Gilks
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

International audienceWe present a dual-view mixture model to cluster users based on their features and latent behavioral functions. Every component of the mixture model represents a probability density over a feature view for observed user attributes and a behavior view for latent behavioral functions that are indirectly observed through user actions or behaviors. Our task is to infer the groups of users as well as their latent behavioral functions. We also propose a non-parametric version based on a Dirichlet Process to automatically infer the number of clusters. We test the properties and performance of the model on a synthetic dataset that represents the participation of users in the threads of an online forum. Experiments show that dual-view models outperform single-view ones when one of the views lacks information

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

HAL

HAL-INSA Toulouse

Hal-Diderot

A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data

Author: A Enright
A Gavin
A Grigoriev
A Hoerl
AJ Dobson
EG WS Cleveland
G GH
GRG Lanckriet
H Ge
M Deng
M Eisen
M Fellenberg
MPS Brown
O Troyanskaya
P Liang
P Pavlidis
P Pavlidis
R Overbeek
R Tibshirani
Walter L Ruzzo
WS Noble
Y Zheng
Zizhen Yao
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources. METHODS: In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems. RESULTS: We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM) algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly. CONCLUSION: Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets

The Hong Kong Polytechnic University Pao Yue-kong Library

Unsupervised fuzzy pattern discovery in gene expression data

Author: A Ben-Dor
AKC Wong
AKC Wong
AKC Wong
Andrew KC Wong
C Creighton
E Chitsaz
E Domany
FD Smet
G Piatetsky-Shapiro
Gene PK Wu
J Yen
Keith CC Chan
L Liu
MB Eisen
MPS Brown
SC Madeira
TR Golub
U Alon
W Pedrycz
WH Au
Y Wang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

2010-2011 > Academic research: refereed > Publication in refereed journalpublished_fina

PolyU Institutional Repository

Public Library of Science (PLOS)

Segmentation of Multi-Isotope Imaging Mass Spectrometry Data for Semi-Automatic Detection of Regions of Interest

Author: B Schölkopf
BE Boser
C Burges
C Cortes
C Lechene
C Lechene
C-W Hsu
CC Chang
Christoph W. Turck
Claude Lechene
D-S Zhang
E Frank
G Cohen
G McMahon
G Székely
I El-Naqa
I Guyon
J. Collin Poczatek
JA Nelder
M Steinhauser
MPS Brown
N Cristianini
NR Pal
Philipp Gormanns
RJ Zawadzki
S Hua
Simon Rogers
Stefan Reckow
U Kreßel
X Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Multi-isotope imaging mass spectrometry (MIMS) associates secondary ion mass spectrometry (SIMS) with detection of several atomic masses, the use of stable isotopes as labels, and affiliated quantitative image-analysis software. By associating image and measure, MIMS allows one to obtain quantitative information about biological processes in sub-cellular domains. MIMS can be applied to a wide range of biomedical problems, in particular metabolism and cell fate [1], [2], [3]. In order to obtain morphologically pertinent data from MIMS images, we have to define regions of interest (ROIs). ROIs are drawn by hand, a tedious and time-consuming process. We have developed and successfully applied a support vector machine (SVM) for segmentation of MIMS images that allows fast, semi-automatic boundary detection of regions of interests. Using the SVM, high-quality ROIs (as compared to an expert's manual delineation) were obtained for 2 types of images derived from unrelated data sets. This automation simplifies, accelerates and improves the post-processing analysis of MIMS images. This approach has been integrated into “Open MIMS,” an ImageJ-plugin for comprehensive analysis of MIMS images that is available online at http://www.nrims.hms.harvard.edu/NRIMS_ImageJ.php

Harvard University - DASH

MPG.PuRe

Predicting residue-wise contact orders in proteins by support vector regression

Author: A Bairoch
AG Murzin
AR Kinjo
AR Kinjo
AR Kinjo
AR Kinjo
B Rost
CH Tsai
D Kihara
D Sarda
DT Jones
G Pollastri
G Pollastri
GP Raghava
HM Berman
J Song
J Wang
Jiangning Song
JM Chandonia
Kevin Burrage
KW Plaxco
M Punta
MPS Brown
NP Prabhu
S Ahmad
S Hua
S Hua
V Vapnik
V Vapnik
W Kabsch
W Liu
X Wang
Z Yuan
Z Yuan
Z Yuan
Z Yuan
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. RESULTS: We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. CONCLUSION: The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences

Queensland University of Technology ePrints Archive

University of Queensland eSpace

PlasmoDraft: a database of Plasmodium falciparum gene function predictions based on postgenomic data

Author: A Gasch
A Mateos
A Vazquez
C Brun
D LaCount
D Lockhart
E Dahl
E Marcotte
E Pizzi
E Sonnhammer
G Yona
J Dougherty
J Sachs
J Shock
J Young
Jean-François Dufayard
K Le Roch
K Le Roch
L Dice
L Florens
L Wu
Laurent Bréhélin
M Gardner
M Llinas
MB Eisen
MPS Brown
MR Chmielewski
O Bastion
Olivier Gascuel
P Langley
P Toronen
PT Spellman
S Altschul
T Hastie
Y Chen
Y Zhou
Y Zhou
Z Bozdech
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Of the 5 484 predicted proteins of <it>Plasmodium falciparum</it>, the main causative agent of malaria, about 60% do not have sufficient sequence similarity with proteins in other organisms to warrant provision of functional assignments. Non-homology methods are thus needed to obtain functional clues for these uncharacterized genes. Results We present PlasmoDraft <url>http://atgc.lirmm.fr/PlasmoDraft/</url>, a database of Gene Ontology (GO) annotation predictions for <it>P. falciparum </it>genes based on postgenomic data. Predictions of PlasmoDraft are achieved with a <it>Guilt By Association </it>method named Gonna. This involves (1) a predictor that proposes GO annotations for a gene based on the similarity of its profile (measured with transcriptome, proteome or interactome data) with genes already annotated by GeneDB; (2) a procedure that estimates the confidence of the predictions achieved with each data source; (3) a procedure that combines all data sources to provide a global summary and confidence estimate of the predictions. Gonna has been applied to all <it>P. falciparum </it>genes using most publicly available transcriptome, proteome and interactome data sources. Gonna provides predictions for numerous genes without any annotations. For example, 2 434 genes without any annotations in the Biological Process ontology are associated with specific GO terms (<it>e.g</it>. Rosetting, Antigenic variation), and among these, 841 have confidence values above 50%. In the Cellular Component and Molecular Function ontologies, 1 905 and 1 540 uncharacterized genes are associated with specific GO terms, respectively (740 and 329 with confidence value above 50%). Conclusion All predictions along with their confidence values have been compiled in PlasmoDraft, which thus provides an extensive database of GO annotation predictions that can be achieved with these data sources. The database can be accessed in different ways. A global view allows for a quick inspection of the GO terms that are predicted with high confidence, depending on the various data sources. A gene view and a GO term view allow for the search of potential GO terms attached to a given gene, and genes that potentially belong to a given GO term.</p

Would the field of cognitive neuroscience be advanced by sharing functional MRI data?

Author: A Bischoff-Grethe
AR Laird
AR Laird
AR Laird
AR Laird
B Nelson
BB Biswal
D Tomasi
Daniel H Weissman
DC Van Essen
DC Van Essen
DD Cox
FA Nielsen
J Carp
J Derrfuss
J Dickson
J Jonides
JD Haynes
JD Van Horn
JL Teeters
JM Chein
JT Serences
JT Serences
JV Haxby
KA Norman
KD Fitzgerald
Kristina M Visscher
L Maccotta
M Greicius
MM Botvinick
MPS Brown
National Institutes of Health
Organisation for Economic Co-operation and Development
PT Fox
R Cabeza
SM Smith
T Yarkoni
T Yarkoni
TD Wager
Y Kamitani
Y Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

During the past two decades, the advent of functional magnetic resonance imaging (fMRI) has fundamentally changed our understanding of brain-behavior relationships. However, the data from any one study add only incrementally to the big picture. This fact raises important questions about the dominant practice of performing studies in isolation. To what extent are the findings from any single study reproducible? Are researchers who lack the resources to conduct a fMRI study being needlessly excluded? Is pre-existing fMRI data being used effectively to train new students in the field? Here, we will argue that greater sharing and synthesis of raw fMRI data among researchers would make the answers to all of these questions more favorable to scientific discovery than they are today and that such sharing is an important next step for advancing the field of cognitive neuroscience

Deep Blue Documents at the University of Michigan

Missing value imputation for microarray gene expression data using histone acetylation information

Author: AA Alizadeh
AL Clayton
AP Gasch
C Rich
Caisheng He
CM Perou
D Schubeler
DE Koryakov
DJ Duggan
DK Pokholok
E Segal
GC Yuan
GCLY Yuan
H Kim
H Yoshimoto
HY Yu
I Takemasa
J Tuikkala
JA Orr
Jiang Wang
Jihua Feng
JJ Hu
JL DeRisi
JL Schafer
KJ Kim
KW McCool
L Mariño-Ramírez
L Narlikar
L Verdone
M Ouyang
MB Eisen
MD Meneghini
MPS Brown
MS Kobor
MSB Sehgal
O Alter
O Alter
O Troyanskaya
OJ Rando
P Johansson
P Spellman
Qian Xiang
RJA Little
S Chatterjee
S Oba
S Raychaudhuri
SA Armstrong
SC Kim
SK Kurdistani
TR Golub
TR O'Connor
TY Roh
X Feng
X Guo
Xianhua Dai
Yangyang Deng
Zhiming Dai
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background It is an important pre-processing step to accurately estimate missing values in microarray data, because complete datasets are required in numerous expression profile analysis in bioinformatics. Although several methods have been suggested, their performances are not satisfactory for datasets with high missing percentages. Results The paper explores the feasibility of doing missing value imputation with the help of gene regulatory mechanism. An imputation framework called histone acetylation information aided imputation method (HAIimpute method) is presented. It incorporates the histone acetylation information into the conventional KNN(<it>k</it>-nearest neighbor) and LLS(local least square) imputation algorithms for final prediction of the missing values. The experimental results indicated that the use of acetylation information can provide significant improvements in microarray imputation accuracy. The HAIimpute methods consistently improve the widely used methods such as KNN and LLS in terms of normalized root mean squared error (NRMSE). Meanwhile, the genes imputed by HAIimpute methods are more correlated with the original complete genes in terms of Pearson correlation coefficients. Furthermore, the proposed methods also outperform GOimpute, which is one of the existing related methods that use the functional similarity as the external information. Conclusion We demonstrated that the using of histone acetylation information could greatly improve the performance of the imputation especially at high missing percentages. This idea can be generalized to various imputation methods to facilitate the performance. Moreover, with more knowledge accumulated on gene regulatory mechanism in addition to histone acetylation, the performance of our approach can be further improved and verified.</p

A new method for class prediction based on signed-rank algorithms applied to Affymetrix® microarray experiments

Author: A Holleman
A Ploner
AA Alizadeh
AB Olshen
Aurélien Vassal
Bernard Klein
BM Bolstad
C Lottaz
D Jelinek
Dirk Hose
DJ Lockhart
F Zhan
G Russo
G Wright
H Zhang
HA Kestler
Hartmut Goldschmidt
IS Lossos
J De Vos
J Khan
J Moreaux
J Quackenbush
JN McClintick
John De Vos
JP Vert
K Kadota
K Mahtouk
K Mahtouk
K Tarte
KJ Savage
L Bullinger
M Nelson
M Reimers
MA Shipp
MPS Brown
Pierre-Olivier Poulain
PJ Valk
R Hoffmann
R Simon
R Tibshirani
RO Duda
S Datta
S Michiels
S Paik
S Rao
S Wold
SO Zakharkin
SS Dave
T Nilsson
Thierry Rème
TR Golub
Véronique Pantesco
WM Liu
X Huang
Y Tu
Y Vasconcelos
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The huge amount of data generated by DNA chips is a powerful basis to classify various pathologies. However, constant evolution of microarray technology makes it difficult to mix data from different chip types for class prediction of limited sample populations. Affymetrix® technology provides both a quantitative fluorescence signal and a decision (<it>detection call</it>: absent or present) based on signed-rank algorithms applied to several hybridization repeats of each gene, with a per-chip normalization. We developed a new prediction method for class belonging based on the detection call only from recent Affymetrix chip type. Biological data were obtained by hybridization on U133A, U133B and U133Plus 2.0 microarrays of purified normal B cells and cells from three independent groups of multiple myeloma (MM) patients. Results After a call-based data reduction step to filter out non class-discriminative probe sets, the gene list obtained was reduced to a predictor with correction for multiple testing by iterative deletion of probe sets that sequentially improve inter-class comparisons and their significance. The error rate of the method was determined using leave-one-out and 5-fold cross-validation. It was successfully applied to (i) determine a sex predictor with the normal donor group classifying gender with no error in all patient groups except for male MM samples with a Y chromosome deletion, (ii) predict the immunoglobulin light and heavy chains expressed by the malignant myeloma clones of the validation group and (iii) predict sex, light and heavy chain nature for every new patient. Finally, this method was shown powerful when compared to the popular classification method Prediction Analysis of Microarray (PAM). Conclusion This normalization-free method is routinely used for quality control and correction of collection errors in patient reports to clinicians. It can be easily extended to multiple class prediction suitable with clinical groups, and looks particularly promising through international cooperative projects like the "Microarray Quality Control project of US FDA" MAQC as a predictive classifier for diagnostic, prognostic and response to treatment. Finally, it can be used as a powerful tool to mine published data generated on Affymetrix systems and more generally classify samples with binary feature values.</p

HAL-Inserm